Model Ensembling and Classification for:
Harry Potter and the Philosopher’s Stone (1997)
Harry Potter and the Chamber of Secrets (1998)
Harry Potter and the Prisoner of Azkaban (1999)
Harry Potter and the Goblet of Fire (2000)
Harry Potter and the Order of the Phoenix (2003)
Harry Potter and the Half-Blood Prince (2005)
Harry Potter and the Deathly Hallows (2007)
Bottom Line Up Front
I intend to answer the question:
Which Harry Potter film is closest to its corresponding book?
Answering this question took 5 steps:
- Define documents
- Data augmentation
- Perform stacked ensemble modeling
- Predict scripts on book model
- Measure via predicted probabilities
I) Define documents
- Description: Create documents at the page-level
- Purpose: Define portions of text that are small enough to provide many examples for the model but large enough to capture meaningful differences in text per book
II) Data augmentation
- Description: Balance classes by oversampling from shorter pieces of text from training pages
- Purpose: Enrich training data to improve predictions of shorter books
III) Structure text for 4 models
- Description: Build 4 document term matricies (DTM) using different NLP techniques
- Purpose: Use multiple NLP techniques in order to take advantage of each of their strengths
IV) Run 4 models
- Description: Run 4 models independently with hyper-parameter tuning
- Purpose: Optimize 4 models
V) Perform stacked ensemble modeling
- Description: Ensemble 4 bottom layer models with top layer model
- Purpose: Take strengths of each model and minimize each model’s weaknesses
VI) Determine final model performance
- Description: Test results of stacked model ensemble on testing data
- Purpose: Ensure model process and outcome is generalizable